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A method for correlating a peptide fragment mass spectrum with amino acid sequences derived from a database is provided. A 
peptide is analyzed by a tandem mass spectrometer to yield a peptide fragment mass spectrum. A protein sequence database or a nucleotide 
sequence database is used to predict one or more fragment spectra for comparison with the cxperimentally-Klerived fragment spectrum. In 
one embodiment, sub-sequences of the sequences found on the database which define a peptide having a mass substantially equal to the mass 
of the peptide analyzed by the tandem mass spectrometer arc identified as candidate sequences. For each candidate sequence, a plurality 
of fra^nents of the sequence are identified and the masses and m/z ratios of the fragments are predicted and used to form a predicted 
mass spectnun. The various predicted mass spectra are compared to the experimentally derived fragment spectnim using a closeness-of-fit 
measure, preferably calculated with a two-step pnx:ess, including a calculation of a preUminaiy score end, for the highest-scoring predicted 
spectra, calculation of a correlation function. 
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IDENTIRCATION OF NUCLEOTrOES, AMINO ACIDS, OR CARBOHYDRATES BY MASS 
SPECTROMETRY 



Governinent: Support 
Certain aspects of this invention were made with 
partial support under grant 8809710 froro the National Science 
Foundation and grant R01GM52095 from the National Institutes 
of Health. The U.S. Government may have certain rights in 
this invention . 

Related Application 
The present application is a continuation-in-part 
of U.S. Serial No. 08/212,433, filed March 14, 1994, which is 
incorporated herein by reference. 

Baclcaround Of The Invention 

A number of approaches have been used in the past 
for applying the analytic power of mass spectrometry to 
peptides. Tandem mass spectrometry (MS/MS) techniques have 
been particularly useful. In tandem mass spectrometry, the 
peptide or other input (coinmonly obtained from a 
chromatography device) is applied to a first mass spectrometer 
which serves to select, from a mixture of peptides, a target 
peptide of a particular mass. The target peptide is then 
activated or fragmented to produce a mixture of the "target" 
or parent peptide and various component fragments, typically 
peptides of smaller mass. This mixture is then transmitted to 
a second mass spectrometer which records a fragment spectrxim. 
This fragment spectrum will typically be expressed in the form 
of a bar graph having a plurality of peaks, each peak 
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indica-ting "the mass-to-change ra-tio (m/z) of a de'teoted 
fragment and having an intensity value. 

Although the bare fragment spectrum can be of some 
interest, it is often desired to use the fragment spectrum to 
5 identify the peptide (or the parent protein) which resulted in 

the fragment mixture. Previous approaches have typically 
involved using the fragment spectrum as a basis for 
hypothesizing one or more candidate amino acid sequences. 
This procedure has typically involved human analysis by a 

10 skilled researcher, although at least one automated procedure 

has been described. John Yates, III, et al., "Computer Aided 
Interpretation of Low Energy MS/MS Mass Spectra of Peptides" . 
Techniques In Protein Chemistrv II (1991), pp. 477-485, 
incorporated herein by reference. The candidate sequences can 

15 then be compared with known amino acid sequences of various 

proteins in the protein sequence libraries. 

The procedure which involves hypothesizing 
candidate amino acid sequences based on fragment spectra is 
useful in a number of contexts but also has certain 

20 difficulties. Interpretation of the fragment spectra so as to 

produce candidate amino acid secjuences is time^-consuming, 
often inaccurate, highly technical and in general can be 
performed only by a few laboratories with extensive experience 
in tandem mass spectrometry. Reliance on human interpretation 

25 often means that analysis is relatively slow and lacks strict 

objectivity. Approaches based on peptide mass mapping are 
limited to peptide masses derived from an intact homogenous 
protein generated by specific and known proteolytic cleavage 
and thus are not generally applicable to mixtures of proteins. 

30 Accordingly, it would be useful to provide a system 

for correlating fragment spectra with known protein sequences 
while avoiding the delay and/or subjectivity involved in 
hypothesizing or deducing candidate amino acid sequences from 
the fragment spectra. 



35 
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Summary Of The Invention 
According to the present invention, known amino 
acid sequences, e.g« , in a protein sequence library, are used 
to calculate or predict one or more candidate fragment 
5 spectra. The predicted fragment spectra are then compared 

with an experimentally-derived fragment spectrum to determine 
the best match or matches. Preferably, the parent peptide, 
from which the fragment spectrum was derived has a known mass. 
Sub- sequences of the various sequences in the protein 

10 secpience library are analyzed to identify those sxib-^sequences 

corresponding to a peptide whose mass is equal to (or within a 
given tolerance of) the mass of the parent peptide in the 
fragment spectrum. For each sub-sequence having the proper 
mass, a predicted fragment spectrum can be calculated, e.g., 

15 by calculating masses of various amino acid subsets of the 

candidate peptide. The result will be a plurality of 
candidate peptides, each with a predicted fragment spectirum. 
The predicted fragment spectra can then be compared with the 
fragment spectrum derived from the tandem mass spectrometer to 

20 identify one or more proteins having sub-sequences which are 

likely to be identical with the sequence of the peptide which 
resulted in the experimentally-derived fragment spectrum. 



Brief Description Of The Drawings 
25 Fig. 1 is a block diagram depicting previous 

methods for correlating tandem mass spectrometer data with 
sequences from a protein sequence library; 

Fig. 2 is a block diagram showing a method for 
correlating tandem mass spectrometer data with sequences from 
30 a protein sequence library according to an embodiment of the 

present invention; 

Fig. 3 is a flow chart showing steps for 
correlating tandem mass spectrometry data with amino acid 
sequences, according to an embodiment of the present 
35 invention; 
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Fig. 4 is a flow diagram showing details of a 
method for the step of identifying candidate sub«»seguences of 
Fig. 3; 

Fig. 5 is a fragment mass spectrum for a peptide of 
5 a type that can be used in connection with the present 

invention ; and 

Figs. 6A-6D are flow charts showing an analysis 
method, according to an embodiment of the present invention. 



10 

Description Of The Specific Embodiments 
Before describing the embodiments of the present 
invention, it will be useful to describe, in greater detail, a 
previous method. As depicted in Fig. 1, the previous method 

15 is used for analysis of an unknown peptide 12. Typically the 

peptide will be output from a chromatography column which has 
been used to separate a partially fractionated protein. The 
protein can be fractionated by, for example, gel filtration 
chromatography and/or high performance liquid chromatography 

20 (HPLC) • The sample 12 is introduced to a tandem mass 

spectrometer 14 through an ionization method such as 
electrospray ionization (ES) . In the first mass spectrometer, 
a peptide ion is selected, so that a targeted component of a 
specific mass, is separated from the rest of the sample 14a. 

25 The targeted component is then activated or decomposed. In 

the case of a peptide, the result will be a mixture of the 
ionized parent peptide ("precursor ion") and component 
peptides of lower mass which are ionized to various states. A 
number of activation methods can be used including collisions 

30 with neutral gases (also referred to as collision induced 

dissolution) . The parent peptide and its fragments are then 
provided to the second mass spectrometer 14c, which outputs an 
intensity and m/z for each of the plurality of fragments in 
the fragment mixture. This information can be output as a 

35 fragment mass spectrum 16. Fig. 5 provides an example of such 

a spectzxim 16. In the spectrum 16 each fragment ion is 
represented as a bar graph whose abscissa value indicates the 
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mass-to-charge ra'tio (m/z) and whose ordinate value represents 
intensity. According to previous methods, in order to 
correlate a fragment spectrum with sequences from a protein 
sequence library, a fragment sequence was converted into one 
. 5 or more amino acid sequences judged to correspond to the 

fragment spectrum. In one strategy, the weight of each of the 
sunino acids is subtracted from the molecular weight of the 
parent ion to determine what might be the molecular weight of 
a fragment assiming, respectively, each amino acid is in the 

10 terminal position. It is determined if this fragment mass is 

found in the actual measured fragment spectrvim* Scores are 
generated for each of the amino acids and the scores are 
sorted to generate a list of partial sequences for the next 
svibtraction cycle. Cycles continue until subtraction of the 

15 mass of an amino acid leaves a difference of less than 0.5 and 

greater than -0.5. The result is one or more candidate amino 
acid sequences 18. This procedure can be automated as 
described, for example, in Yates III (1991) supra . one or 
more of the highest-scoring candidate sequences can then be 

20 compared 21 to sequences in a protein sequence library 20 to 

try to identify a protein having a sub-sequence similar or 
identical to the sequence believed to correspond to the 
peptide which generated the fragment spectrum 16. 

Fig. 2 shows an overview of a process according to 

25 the present invention. According to the process of Fig. 2, a 

fragment spectrum 16 is obtained in a manner similar to that 
described above for the fragment spectrum depicted in Fig. 1. 
Specifically, the sample 12 is provided to a tandem mass 
spectrometer 14. Procedures described below use a two-step 

30 process to acquire ms/ms data. However the present invention 

can also be used in connection with mass spectrometry 
approaches currently being developed which incorporate 
acquisition of ms/ms data with a single step. In one 
embodiment ms/ms spectra would be acquired at each mass. The 

35 first ms would separate the ions by mass-to-charge and the 

second would record the ms/ms spectrum. The second stage of 
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ms/ms would acquire, e.g. 5 to 10 spectra at each mass 
transformed by the first ms. 

A niunber of mass spectrometers can be used 
including a triple-quadruple mass spectrometer, a Fourier- 
transform cyclotron resonance mass spectrometer, a tandem 
time-of -flight mass spectrometer and a quadrupole ion trap 
mass spectrometer. In the process of Fig. 2, however, it is 
not necessary to use the fragment spectrum as a basis for 
hypothesizing one or more amino acid sequences. In the 
process of Fig. 2, sub-sequences contained in the protein 
sequence library 20 are used as a basis for predicting a 
plurality of mass spectra 22, e.g., using prediction 
technic[ues described more fully below. 

A number of sequence libraries can be used, 
including, for example, the Genpept database, the GenBank 
database (described in Burks, et al., "GenBank: Current status 
and future directions in Methods in Enzymology*^ 183:3 
(1990)), EMBIi data library (described in Kahn, et al., "EMBL 
Data Library," Methods in Enzvmoloay, 183:23 (1990)), the 
Protein Sequence Database (described in Barker, et al., 
"Protein Sequence Database," Methods in Enzymoloay , 1983:31 
(1990), SWISS-PROT (described in Bairoch, et al., "The SWISS- 
PROT protein sequence data bank, recent developments," Nucleic 
Acids Res . . 21:3093-3096 (1993)), and PIR-International 
(described in "Index of the Protein Sequence Database of the 
International Association of Protein Sequence Databanks (PIR- 
Intemational)" Protein Sea Data Anal. 5:67-192 (1993). 

The predicted mass spectra 22 are compared 24 to 
the experimentally-derived fragment spectrum 16 to identify 
one or more of the predicted mass spectra which most closely 
match the experimentally-derived fragment spectrum 16. 
Preferably the comparison is done automatically by calculating 
a closeness-of-f it measure for each of the plurality of 
predicted mass spectra 22 (compared to the experimentally- 
derived fragment spectrum 16) . It is believed that, in 
general, there is high probability that the peptide analyzed 
by the tandem mass spectrometer has an amino acid sequence 
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identical to one of 'the sub-*sequences taken from the protein 
sequence library 20 which resulted in a predicted mass 
spectrum 22 exhibiting a high closeness-of-f it with respect to 
the experimentally-derived fragment spectrum 16. Furthermore, 
• 5 when the peptide analyzed by the tandem mass spectrometer 14 

was derived from a protein, it is believed there is a high 
probability that the parent protein is identical or similar to 
the protein whose sequence in the protein sequence library 20 
includes a sub-sec[uence that resulted in a predicted mass 

10 spectra 22 which had a high closeness-of-f it with respect to 

the fragment spectrum 16 • Preferably, the entire procedure 
can be performed automatically using, e.g, a computer to 
calculate predicted mass spectra 22 and/or to perform 
comparison 24 of the predicted mass spectra 22 with the 

15 experimentally-derived fragment spectrum 16. 

Fig. 3 is a flow diagram showing one method for 
predicting mass spectra 22 and performing the comparison 24. 
According to the method of Fig. 3, the experimentally-derived 
fragment spectrum 16 is first normalized 32. According to one 

20 normalization method, the experimentally-derived fragment 

spectrum 16 is converted to a list of masses and intensities. 
The values for the precursor ion are removed from the file. 
The square root of all the intensity values is calculated and 
normalized to a maximum intensity of 100. The 200 most 

25 intense ions are divided into ten mass regions and the maximum 

intensity is normalized to 100 within each region. Each ion 
which is within 3.0 daltons of its neighbor on either side is 
given the greater intensity value, if a neighboring intensity 
is greater than its own intensity. Of course, other 

30 normalizing methods can be used and it is possible to perform 

analysis without performing normalization, although 
normalization is, in general, preferred. For example, it is 
possible to use maximum intensities with a value greater than 
or less than 100. It is possible to select more or fewer than 

35 the 200 most intense ions. It is possible to divide into more 

or fewer than 10 mass regions. It is possible to make the 
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window for assuming the neighboring intensity value to be 
greater than or less than 3.0 daltons. 

In order to generate predicted mass spectra from a 
protein sequence library, according to the process of Fig. 3, 
5 sub-seguences within each protein sequence are identified 

which have a mass which is within a tolerance amount of the 
mass of the unknown peptide. As noted above, the mass of the 
unknown peptide is known from the tandem mass spectrometer 34. 
Identification of candidate stib-sequences 34 is shown in 

10 greater detail in Fig. 4. In general, the process of 

identifying candidate sub-sequences involves summing the 
masses of linear amino acid sequences until the sum is either 
within a tolerance of the mass of the unknown peptide (the 
**target" mass) or has exceeded the target mass (plus 

15 tolerance) . If the mass of the sequence is within tolerance 

of the target mass, the sequence is marked as a candidate. If 
the mass of the linear sequence exceeds the mass of the 
unknown peptide, then the algorithm is repeated, beginning 
with the next amino acid position in the sec[uence. 

20 According to the method of Fig. 4, a variable m, 

indicating the starting amino acid along the seG[uence is 
initialized to 0 and incremented by 1 (36, 38) • The stm, 
representing the cumulative mass and a variable n representing 
the number of amino acids thus far calculated in the sum, are 

25 initially set to O (40) and variable n is incremented 42. The 

molecular weight of a peptide corresponding to a sub-sequence 
of a protein sequence is calculated in iterative fashion by 
steps 44 and 46. In each iteration, the svim is Incremented by 
the molecular weight of the amino acid of the next (nth) amino 

30 acid in the sequence 44. Values of the sum 44 may be stored 

for use in calculating fragment masses for use in predicting a 
fragment mass spectnun as described below. If the resulting 
sum is less than the target mass decremented by a tolerance 
46, the value of n is incremented 42 and the process is 

35 repeated 44. A number of tolerance values can be used. In 

one embodiment, a tolerance value of ±0.05% of the mass of the 
unknown peptide was used. If the new sum is no longer less 
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tiian a "tolerance amount below the target mass, it is then 
determined if the new sum is greater than the ^target mass plus 
the tolerance amount. If the new sum is more than the 
tolerance amount in excess of the target mass, t:his par'ticular 
sequence is not considered a candidate sequence and the 
process begins again, starting from a new starting point in 
the sequence . (by incrementing the starting point value m 
(38)). If, however, the sum is not greater than the target 
mass plus the tolerance amount, it is known that the sum is 
within one tolerance amount of a target mass and, thus, that 
the sub*-»seguence beginning with mth amino and extending to the 
(m + n)th amino acid of the sequence is a candidate sequence. 
The candidate sequence is marked, e.g., by storing the values 
of m and n to define this sub-sequence. 

Returning to Fig. 3, once a plurality of candidate 
sub*-sequences have been identified, a fragment mass spectrum 
is predicted for each of the candidate sequences 52. The 
fragment mass spectrum is predicted by calculating the 
fragment ion masses for the type b- and y- ions for the amino 
acid sequence. When a peptide is fragmented and the charge is 
retained on the N-terminal cleavage fragment, the resulting 
ion is labelled as a b-type ion. If the charge is retained on 
the c-type terminal fragment, it is labelled a y-type ion. 
Masses for type b- ions were calculated by summing the amino 
acid masses and adding the mass of a proton. Type y- ions 
were calculated by summing, from the c-terminus, the masses of 
the amino acids and adding the mass of water and a proton to 
the initial amino acid. In this way, it is possible to 
calculate an m/z for each fragment. However, in order to 
provide a predicted mass spectrum, it is also necessary to 
assign an intensity value for each fragment. It might be 
possible to predict, on a theoretical basis, intensity value 
for each fragment. However, this procedure is difficult. It 
has been found useful to assign intensities in the following 
fashion. The value of 50.0 is assigned to each b and y ion. 
To masses of 1 dalton on either side of the fragment ion, an 
intensity of 25.0 is assigned. Peak intensities of 10.0 and - 
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17.0 and -18.0 dal'tons below the n/z of each b- and Ion 
location (for both NH3 and H2O loss), and peak intensities of 
10.0 and -28.0 amu of each type b ion location (for type a- 
ions) • 

5 Returning to Fig. 3, after calculation of predicted 

m/z values and assignment of intensities, it is preferred to 
calculate a measure of closeness-of-f it between the predicted 
mass spectra 22 and the experimentally-derived fragment 
spectrum 16. A number of methods for calculating closeness- 

10 of-fit are available. In the embodiment depicted in Fig. 3, a 

two-step method is used 54 . The two-step method includes 
calculating a preliminary closeness-of-f it score, referred to 
here as Sp 56 and, for the highest-scoring amino acid 
sequences, calculating a correlation function 58. According 

15 to one embodiment, Sp is calculated using the following 

formula : 



where i„ = matched intensities, n^^ — number of matched 

20 fragment ions, 0 = type b- and y-ion continuity, p = presence 

of immonium ions and their respective amino acids in the 
predicted sequence, n^. = total number of fragment ions. The 
factor, /3, evaluates the continuity of a fragment ion series. 
If there was a fragment ion match for the ion immediately 

25 preceding the current type b- or y-ion, 0 is incremented by 

0.075 (from an initial value of 0.0). This increases the 
preliminary score for those peptides matching a successive 
series of type b- and y-ions since extended series of ions of 
the same type are often observed in MS/MS spectra. The factor 

30 p evaluates the presence of immonium ions in the low mass end 

of the mass spectrum. Immonium ions are diagnostic for the 
presence of some types of amino acids in the sequence. If 
immonium ions are present at 110.0, 120.0, or 136.0 Da (± l.o 
Da) in the processed data file of the unknown peptide with 

35 normalized intensities greater than 40.0, indicating the 
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presence of histiidine, phenylalanine, and tiyrosine 
respectively, then the sequence under evaluation is checked 
for the presence of the amino acid indicated by the iiniaoniuxn 
ion. The preliminary score, Sp, for the peptide is either 
augmented or depreciated by a factor of (1 - p) where p is the 
sxm of the penalties for each of the three amino acids whose 
presence is indicated in the low mass region. Each individual 
p can take on the value of -0.15 if there is a corresponding 
low mass peak and the amino acid is not present in the 
sequence, +0.15 if there is a corresponding low mass peak and 
the amino acid is present in the sequence, or 0.0 if the low 
mass peak is not present. The total penalty can range from 
-0.45 (all three low mass peaks present in the spectrum yet 
none of the three amino acids are in the sequence) to +0.45 
(all three low mass peaks are present in the spectrum and all 
three amino acids are in the secjuence) • 

Following the calculation of the preliminary 
closeness-of-f it score Sp, those candidate predicted mass 
spectra having the highest Sp scores are selected for further 
analysis using the correlation function 58. The number of 
candidate predicted mass spectra which are selected for 
further analysis will depend lazrgely on the computational 
resources and time available. In one embodiment, 300 
candidate peptide sequences with the highest preliminary score 
were selected. 

For purposes of calculating the correlation 
function, 58, the experimentally-derived fragment spectrum is 
preprocessed in a fashion somewhat different from 
preprocessing 32 employed before calculating Sp. For purposes 
of the correlation function, the precursor ion was removed 
from the spectrum and the spectrum divided into 10 sections. 
Ions in each section were then normalized to 50.0. The 
sectionwise normalized spectra 60 were then used for 
calculating the correlation function. According to one 
embodiment, the discrete correlation between the two functions 
is calculated as: 
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= E-^iyi+T C2) 

where r is a lag value. The discrete correlation theorem 
states that the discrete correlation of two real functions x 
and y is one mexober of the discrete Fourier transform pair 

J^^X^Y^x <3) 



5 where X(t} and Y(t} are the discrete Fourier transforms of 

x(i) and y(i) and the denotes complex conjugation. 
Therefore, the cross-correlations can be computed by Fourier 
transformation of the two data sets using the fast Fourier 
transform (FFT) algorithm, multiplication of one transform by 

10 the complex conjugate of the other, and inverse transformation 

of the resulting product. In one embodiment, all of the 
predicted spectra as well as the pre-processed unknown 
spectrum were zero-padded to 4096 data points since the MS/MS 
spectra are not periodic (as intended by the correlation 

15 theorem) and the FFT algorithm requires N to be an integer 

power of two, so the resulting end effects need to be 
considered. The final score attributed to each candidate 
peptide sequence is R(0) minus the mean of the 
cross-correlation function over the range -75<t<75. This 

20 modified "correlation parameter" described in Powell and 

Heiftje, Anal. Chim. Acta . Vol. 100, pp 313-327 (1978) shows 
better discrimination over just the spectral correlation 
coefficient R(0). The raw scores are normalized to 1.0. In 
one embodiment, output 62 includes the normalized raw score, 

25 the candidate peptide mass, the unnormalized correlation 

coefficient, the preliminary score, the fragment ion 
continuity P, the immonium ion factor p, the number of type b- 
and y-ions matched out of the total number of fragment ions, 
their matched intensities, the protein accession number, and 

30 the candidate peptide sequence. 

If desired, the correlation function 58 can be used 
to automatically select one of the predicted mass spectra 22 
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as corresponding to the experimentally-derived fragment 
spectnim 16. Preferably, however, a number of sequences from 
the library 20 are output and final selection of a single 
sequence is done by a skilled operator. 
5 In addition to predicting mass spectra from protein 

sequence libraries, the present invention also includes 
predicting mass spectra based on nucleotide databases. The 
procedure involves the same algorithmic approach of cycling 
through the nucleotide sec[uence« The 3-base codons will be 

10 converted to a protein sequence and the mass of the amino 

acids summed in a fashion similar to the summing depicted in 
Fig. 4. To cycle through the nucleotide sequence, a 1-base 
increment will be used for each cycle. This will allow the 
determination of an amino acid sequence for each of the three 

15 reading frames in one pass. The scoring and reporting 

procedures for the search can be the same as that described 
above for the protein sequence database. 

Depending on the computing and time resources 
available, it may be advantageous to employ data-reduction 

20 techniques. Preferably these techniques will emphasize the 

most informative ions in the spectrum while not unduly 
affecting search speed. One technicpie involves considering 
only some of the fragment ions in the MS/MS spectrum. A 
spectrum for a peptide may contain as many as 3 , 000 fragment 

25 ions. According to one data reduction strategy, the ions are 

ranked by intensity and some fraction of the most intense ions 
(e.g., the top 200 most intense ions) will be used for 
comparison. Another approach involves subdividing the 
spectrum into, e.g., 4 or 5 regions and using the 50 most 

30 intense ions in each region as part of the data set* Yet 

another approach involves selecting ions based on the 
probability of those ions being sequence ions. For example, 
ions could be selected which exist in mass windows of 57 
through 186 daltons (range of mass increments for the 20 

35 common amino acids from GLY to TRP) that contain diagnostic 

features of type b- or y- ions, such as losses of 17 or 18 
daltons (NH3 or H2O) or a loss of 28 daltons (CO) . 
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The tiechnlgues described above are, in general, 
applicable to spectra of peptides with charged states of +1 or 
+2, typically having a relatively short amino acid sequence. 
Using a longer amino acid sequence increases the probability 
5 of a unique match to a protein sequence. However, longer 

peptide sequences have a greater likelihood of containing more 
basic amino acids, and thus producing ions of higher charge 
state under electro-spray ionization conditions* According to 
one embodiment of the invention, algorithms are provided for 

10 searching a database with MS/MS spectra of highly charged 

peptides (+3, +4, +5, etc.)- According to one approach, the 
search program will include an input for the charge state (N) 
of the precursor ion used in the MS/MS analysis. Predicted 
fragment ions will be generated for each charge state less 

15 than N. Thus, for peptide of +4, the charge states of +1, +2 

and +3 will be generated for each fragment ion and compared to 
the MS/HS spectrxim. 

The second strategy for use with multiply charged 
spectra is the use of mathematical deconvolution to convert 

20 the multiply charged fragment ions to their singly charged 

masses. The deconvoluted spectrum will then contain the 
fragment ions for the multiply charged fragment ions and their 
singly charged counterparts. 

To speed up searches of the database, a directed- 

25 search approach can be used. In cases where experiments are 

performed on specific organisms or specific types of proteins. 
It is not necessary to search the entire database on the first 
pass. Instead, a search sec[uence protein specific to a 
species or a class of proteins can be performed first. If the 

30 search does not provide reasonable answers, then the entire 

database is searched. 

A number of different scoring algorithms can be 
used for determining preliminary closeness of fit or 
correlation. In addition to scoring based on the number of 

35 matched ions multiplied by the sum of the intensity, scoring 

can be based on the percentage of continuous sequence coverage 
represented by the sequence ions in the spectrum. For 



wo 95/25281 



PCT/US95/03239 



15 

example, a 10 residue peptide will potentially contain 9 each 
of b-* and y-type sequence ions. If a set of ions extends from 

to B9, then a score of 100 is awarded, but if a 
discontinuity is observed in the middle of the sequence, such 
« 5 as missing the B5 ion, a penalty is assessed* The maximum 

score is awarded for an amino acid sequence that contains a 
continuous ion series in both the b and y directions. 

In the event the described scoring procedures do 
not delineate an answer, an additional technique for spectral 

10 comparison can be used in which the database is initially 

searched with a molecular weight value and a reduced set of 
fragment ions. Initial filtering of the database occurs by 
matching sequence ions and generating a score with one of the 
methods described above. The resulting set of answers will 

15 then undergo a more rigorous inspection process using a 

modified full MS/MS spectrum. For the second stage analysis, 
one of several spectral matching approaches developed for 
spectral library searching is used. This will require 
generating a "library spectrum" for the peptide sequence based 

20 on the sequence ions predicted for that amino acid sequence* 

Intensity values for sequence ions of the "library spectrum" 
will be obtained from the experimental spectrum. If a 
fragment ion is predicted at m/z 256, then the intensity value 
for the ion in the experimental spectrxim at m/z=256 will be 

25 used as the intensity of the ion in the predicted spectrum. 

Thus, if the predicted spectrum is identical to the "unknown" 
spectrum, it will represent an ideal spectrum. The spectra 
will then be compared using a correlation function. In 
general, it is believed that the majority of computational 

30 time for the above procedure is spent in the iterative search 

process. By multiplexing the analysis of multiple MS/MS 
spectra in one pass through the database, an overall 
improvement in efficiency will be realized. In addition, the 
mass tolerance used in the initial pre** filtering can affect 

35 search times by increasing or decreasing the number of 

sequences to analyze in subsequent steps. Another approach to 
speed up searches involves a binary encryption scheme where 
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-the mass spectrum is encoded as peak/no peak at every mass 
depending on whether the peak is above a certain threshold 
value « If intensive use of a protein sequence library is 
contemplated, it may be possible to calculate and store 
5 predicted mass values of all sub-sequences within a 

predetermined range of masses so that at least some of the 
analysis can be performed by table look-up rather than 
calculation. 

Figs. 6A-*6£ are flow charts showing an analysis 

10 procedure according to one embodiment of the present 

invention. After data is acquired from the tandem mass 
spectrometer, as described above 602, the data is saved to a 
file and converted to an ASCII format 604. At this point, a 
preprocessing procedure is started 606. The user enters 

15 information regarding the peptide mass in the precursor ion 

charge state 608. Mass/ intensity values are loaded from the 
ASCII file, with the values being rounded to unit masses 610. 
The previously-identified precursor ion contribution of this 
data is removed 612. The remaining data is normalized to a 

20 maximum intensity of 100 614. At this point, different paths 

can be taken. In one case, the presence of any immonium ions 
(H, F and Y) is noted 616 and the peptide mass and immonium 
ion information is stored in a datafile 618. In another 
route, the 200 most intense peaks are selected 620. If two 

25 peaks are within a predetermined distance (e.g., 2 amu) of 

each other, the lower intensity peak is set equal to a greater 
intensity 622. After this procedure, the data is stored in a 
datafile for preliminary scoring 624. In another route, the 
data is divided into a number of windows, for example ten 

30 windows 626. Normalization is performed within each window, 

for example, normalizing to a maximum intensity of 50 628. 
This data is then stored in a datafile for final correlation 
scoring 630. This ends the preprocessing phase, according to 
this embodiment 632. 

35 The database search is started 634 and the search 

parameters and the data obtained from the preprocessing 
procedure (Fig. 6A) are loaded 636. A first batch of database 



wo 95/25281 



PCT/US95/03239 



17 

sequences is loaded 638 and a search procedure is run on a 
particular pro-kein 640. The search procedure is detailed in 
Fig. 6C. As long as the end of the batch has not been reached 
the index is incremented 642 and the search routine is 
repeated 640. Once it is determined that the end of a batch 
has been reached 64 4, as long as the end of the database has 
not been reached, the second index 646 is incremented and a 
new batch of database sequences is loaded 638. Once the end 
of the database has been reached 628, a correlation analysis 
is performed 630 (as detailed in Fig. 6£) , the results are 
printed 632 and the procedure ends 634. 

When the search procedure is started 638 (Fig. 6C) , 
an index II is set to zero 64 6 to indicate the start position 
of the candidate peptide within the amino acid being searched 
640. A second index 12, indicating the end position of the 
candidate peptide within the amino acid being searched, is 
initially set equal to Xl and the variable Pmass, indicating 
the accvimulated mass of the candidate peptide is initialized 
to zero 648. During each iteration through a given candidate 
peptide 650 the mass of the amino acid at position 12 is added 
to Pmass 652. It is next determined whether the mass thus-far 
accumulated (Pmass) equals the input mass (i.e., the mass of 
the peptide) 654. In some embodiments, this test may be 
performed as plus or minus a tolerance rather than requiring 
strict equality, as noted above. If there is equality 
(optionally within a tolerance) an analysis routine is started 
656 (detailed in Fig. 6D) . Otherwise, it is determined 
whether Pmass is less than the input mass (optionally within a 
tolerance). If not, the index 12 is incremented 658 and the 
mass of the amino acid at the next position (the incremented 
12 position) is added to Pmass 652. If Pmass is greater than 
the input mass (optionally by more than a tolerance 660) it is 
determined whether index II is at the end of a protein 662. 
If so, the search routine exits 664. Otherwise, index II is 
incremented 666 so that the routine can start with a new start 
position of a candidate peptide and the search procedure 
returns to block 648. 



wo 95/25281 



PCT/US9S/03239 



18 

When 1:he analysis procedure is started 670 (Fig. 
6D) , data indicative of b- and y- ions for the candidate 
peptide are generated 672, as described above. It is 
determined whether the peak is within the top 200 ions 674. 
5 The peak intensity is summed and the fragmented match index is 

incremented 676 • If previous b- or y- ions are matched 678, 
the 0 index is incremented 680. Otherwise, it is determined 
whether all fragment ions have been analyzed. If not, the 
fragment index is incremented 684 and the procedure returns to 

10 block 674. Otherwise, a preliminary score such as Sp, 

described above is calculated 686. If the newly-calculated Sp 
is greater than the lowest score 688 the peptide sec[uence is 
stored 690 unless the sequence has already been stored, in 
which case the procedure exits 692. 

15 At the beginning of the correlation analysis (Fig. 

6E) , a stored candidate peptide is selected 693. A 
theoretical spectrxim for the candidate peptide is created 694, 
correlated with experimental data 695 and a final correlation 
score is obtained 696, as described above. The index is 

20 incremented 697 and the process repeated from block 693 unless 

all candidate peptides have been scored 698, in which case the 
correlation analysis procedure exits 699. 

The following examples are offered by way of 
illustration, not limitation. 

25 

Experimental 
Example #1 

MHC complexes were isolated from HS-EBV cells 
transformed with HIiA-DRB*0401 using antibody affinity 

30 chromatography. Bound peptides were released and isolated by 

filtration through a Centricon 10 spin column. The heavy 
chain of glycosaparginase from human leukocytes was isolated. 
Proteolytic digestions were performed by dissolving the 
protein in 50 mM ammonium bicarbonate containing lo mH Ca"*"^, 

35 pH 8.6. Trypsin was added in a ratio of 100/1 protein/ enzyme. 

Analysis of the resulting peptide mixtures was 
performed by LC-MS and LC-MS/MS. Briefly, molecular weights 
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of peptides were recorded by scanning Q3 or Ql at a rate of 
400 Da/sec over a mass range of 300 to 1600 throughout the 
HPLC gradient. Sec[uence analysis of peptides was performed 
during a second HPLC analysis by selecting the precursor ion 
with a 6 amu (FWHH) wide window in Qj^ and passing the ions 
into a collision cell filled with argon to a pressure of 3-5 
mtorr. Collision energies were on the order of 20 to 50 eV. 
The fragment ions produced in Q2 were transmitted to Q3 and a 
mass range of 50 Da to the molecular weight of the precursor 
ion was scanned at 500 Da/sec to record the fragment ions. 
The low energy spectra of 36 peptides were recorded and stored 
on disk. The genpept database contains protein sequences 
translated from nucleotide sequences. A text search of the 
database was performed to determine if the sequences for the 
peptide sunino acid sequences used in the analysis were present 
in the database. Subsecpiently, a second database was created 
from the whole database by appending amino acid sec[uences for 
peptides not included. 

The spectrum data was converted to a list of masses 
and intensities and the values for the precursor ion were 
removed from the file. The square root of all the intensity 
values was calculated and normalized to a maximum intensity of 
100.0. All ions except the 200 most intense ions were removed 
from the file. The remaining ions were divided into 10 mass 
regions and the maximum intensity normalized to 100.0 within 
each region. Each ion within 3.0 daltons of its neighbor on 
either side was given the greater intensity value, if the 
neighboring intensity was greater than its own intensity. 
This processed data was stored for comparison to the candidate 
sequences chosen from the database search. The MS/MS spectrum 
was modified in a different manner for calculation of a 
correlation function. The precursor ion was removed from the 
spectrum and the spectrum divided into 10 equal sections. 
Ions in each section were then normalized to 50.0. This 
spectrum was used to calculate the correlation coefficient 
against a predicted MS/MS spectrum for each amino acid 
sequence retrieved from the database. 
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Amino acid sequences from each prot:ein were 
generated by summing the masses, using average masses for the 
amino acids, of the linear amino acid sequence from the sunino 
terminus (n) . If the mass of the linear sequence exceeded the 
5 mass of the unknovm peptide, then the algorithm returned to 

the amino terminal amino acid and began summing amino acid 
masses from the n+1 position. This process was repeated until 
every linear amino acid sequence combination had been 
evaluated. When the mass of the amino acid sequence was 

10 within ±0.05% (minimum of ±1 Da) of the mass of the unknown 

peptide, the predicted m/z values for the type b- and y-ions 
were generated and compared to the fragment ions of the 
unknown sequence. A preliminary score (Sp) was generated and 
the top 300 candidate peptide sequences with the highest 

15 preliminary score were ranked and stored. A final analysis of 

the top 300 candidate amino acid sec[uences was performed with 
a correlation function. Using this function a theoretical 
MS/MS spectrum for the candidate sequence was compared to the 
modified experimental MS/MS spectrum. Correlation 

20 coefficients were calculated, ranked and reported. The final 

results were ranked on the basis of the normalized correlation 
coefficient. 

The spectrum shown in Fig. 5 was obtained by 
liC-MS/MS analysis of a peptide bound to a DRB*0401 MHC class 

25 II complex. A search of the genpept database containing 

74,938 protein sequences identified 384,398 peptides within a 
mass tolerance of ±0.05% (minimum of ±lDa) of the molecular 
weight of this peptide. By comparing fragment ion patterns 
predicted for each of these amino acid sequences to the 

30 pre-processed MS/MS spectra and calculating a preliminary 

score, the number of candidate sequences was cutoff at 300. A 
correlation analysis was then performed with the predicted 
MS/MS spectra for each of these sequences and the modified 
experimental MS/MS spectrum. The results of the search 

35 through the genpept database with the spectriim in Fig. 5 are 

displayed in Table 1. Two peptides of similar sequence, 
DLRSWTAADAAQISK [Seq. ID No. 1], DLRSWTAADAAQISQ [Seq. ID No. 
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2], were identified as the highest scoring sec[uences (Cj^ 
values) • Their correlation coefficients are identical so 
their rankings in Table 1 are arbitrary. The amino acid 
sequence DIiRSWTAADAAQISK [Seg. ID No. 1] occurs in five 
5 proteins in the genpept database while the sequence 

DLRSWTAADAAQISQ [Seq. ID No. 2] occurs in only one. The top 
three sequences appear in iitimunologically related proteins and 
the rest of the proteins appear to have no correlation to one 
another. A second search using the same MS/MS spectrum was 

10 performed with the Homo sapiens subset of the genpept database 

to compare the results. These data are presented in Table 2. 
In both searches the correct sequence tied for the top 
position. Both amino acid sequences have identical 
correlation coefficients, C^, although the sequences differ by 

15 Lys and Gin at the C-terminus. These two amino acids have the 

same nominal mass and would be expected to produce similar 
MS/MS spectra. The sum of the normalized fragment ion 
intensities, Ij^, for the matched fragment ions for the two 
peptides are different with the correct sec[uence having the 

20 greater value. The correct sequence also matched an 

additional fragment ion in the preliminary scoring procedure 
identifying 70% of the predicted fragment ions for this amino 
acid sequence in the pre-processed spectriim. These matches 
are determined as part of the preliminary scoring procedure. 
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Example ^2 

To examine ^he complexity of -the mixture of 
peptides obtained by proteolysis of the total proteins from S. 
cerevisiae cells, 10® cells were grown and harvested. After 
5 lysis, the total proteins were contained in ^-9 mL of solution. 
A 0.5 mL aliquot was removed for proteolysis with the enzyme 
trypsin. From this solution two microliters were directly 
injected onto a micro-LC (liquid chromatography) column for MS 
analysis. In a complex mixture of peptides it is conceivable 

10 that multiple peptide ions may exist at the same m/z and 
contribute to increased background, complicating MS/MS 
analysis and interpretation. To test the ability to obtain 
sequence information by MS/MS from these complex mixtures of 
peptides, ions from the mixture were selected with on-line 

15 MS/HS analysis. In no case were the spectra contaminated with 
fragment ions from other peptides. A partial list of the 
identified sequences is presented in Table 3. 



20 Table 3 



g. cerGvlslae Protein Sea. ID No, Amino acid Sequence 



enoiase 


i 




hypusine containing protein HP2 


4 


APEGELGDSLQTAFDEGK 


phosphogly cerate kinase 


5 


TGGGASLELLEGK 


BMHl gene product 


6 


QAFDDAIAELDTLSEESYK 


pyruvate kinase 


7 


IPAGWQGLDNGPSER 


phosphoglycerate kinase 


8 


LPGTDVDLPALSBK 


hexokinase 


9 


lEDDPFENIiEDTDDDPQK 


enoiase 


10 


EEALDLIVDAIK 


enoiase 


11 


NPTVEVELTTEK 



The MS/MS spectra presented in Table 1 were 
40 interpreted using the described database searching method. 

This method serves as a data pre-filter to match MS/MS spectra 
to previously determined amino acid sec[uences. Pre-f iltering 
the data allows interpretation efforts to be focused on 
previously unknown amino acid sequences. Results for some of 
45 the MS/MS spectra are shown in Table 4 . No pre-assigning of 
sequence ions or manual interpretation is required prior to 
the search. However, the sequences must exist in the 
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daliabase. The algorithm first: pre-processed the MS/MS data 
and then compared all the amino acid sequences in the dateOsase 
within ±1 amu of the mass of the precursor ion of the MS/MS 
spectrum* The predicted fragmentation patterns of the amino 
acid sequences within the mass tolerance were compared to the 
experimental spectrum* Once an amino acid sequence was within 
this mass tolerance, a final closeness-of-f it measure was 
obtained by reconstructing the MS/MS spectra and performing a 
correlation analysis to the experimental spectrum. Table 4 
lists a number of spectra used to test the efficacy of the 
algorithm. 

The computer program described above has been 
modified to analyze the MS/MS spectra of phosphorylated 
peptides. In this algorithm all types of phosphorylation are 
considered such as Thr, Ser, and Tyr. Amino acid sec[uences 
are scanned in the database to find linear stretches of 
sequence that are multiples of 80 amu below the mass of the 
peptide under analysis. In the analysis each putative site of 
phosphorylation is considered and attempts to fit a 
reconstructed MS/MS spectrum to the experimental spectrtua are 
made. 
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Table 4 

List of results obtained searching genpept and 
species specific databases using HS/MS spectra for the 

respective peptides. 



9 






Amino Acid Sequence 
















of Peptides used Seq. 


Genpept 


Genpept 


Species 




No. 


Mass 


in the Search ID 


No. 


Database 


Database^ 


Soecific 


10 


I 


1734.5 


bLkSiWAADTAAOiSO 


12 


1 


1 


1 




2 


1749 


DLRSWTAADTAAQITQ 


13 


1 


1 


1 




3 


1186 .5 


MATPLLMQTILP 


14 






13 




4 


1317.7 


MATPLLMQAIiP , _ 


14 




61 


17 




5 


1571.6 


EGVNDNEEGFFSAR J ' ^ 
BGVNDNEEGFFSAR^ • ^ 


15 


1* 


1 


1 


15 


6 


1571.6 


15 


1* 


1 


1 




7 


1297.5 


DRVYIHPFHL<+2) 


16 


1 


1 


1 




8 


1297 .5 


DRVYIHPPHL{+2) 


16 


2 


2 


2 




9 


1297 . 5 


DRVYIHPFHL(+3) 


16 


1 


1 


1 




10 


1593 .8 


VEADVAGHGQDILIR^ 
HGVTVLTAIiGAI LK^ 


17 


1 


1 


1 


20 


11 


1393.7 


18 


1 


1 


1 




12 


1741.8 


HSGOAEGYS YTDANI 


19 


1 


1 


1 




13 
14 


848.8 
723.9 


HSGQAEGY-^i + l) 
MAFGGLK^ 


20 
21 


1 


1 


1 




15 


636.8 


GATLFk2(+1) (QATLFG, KTLFK] 


22 






6 


25 


16 


524.6 


TEFK(+1) ^ ^ 
DRNDLLTYLK^'2 


23 


ii 




5 




17 


1251.4 


24 




5 


1 




18 


1194.4 


VLVLDTDYKK^ 
CRGDSYMCGRDSY) 


25 


I- 


6 


2 




19 


700.7 


26 




1 


1 




20 


700.7 


CRGDSYJ-i+l) 
KGATLPK^ 


26 






7 


30 


21 


764.9 


27 


3 


3 


1 




22 


1169.3 


TGPNLHGLFGR 


28 


1 


1 


1 




23 


1047.2 


DRVYIHPF 


29 






7 




24 


1139 .3 


TLLVGESATTF ( +1 ) 


30 


1 


1 


1 




25 


1189 .4 


RNVIPDSKY 


31 


1 


1 


1 


35 


26 


613.7 


SSPLPL(+1) 


32 


2 


4 


2 




27 


1323.5 


LARNCQPN YW ( C= 1 6 1 . 1 7 ) 


33 


1 


1 


1 




28 


2496 .7 


AQSMGFINEDLSTSAQALMSDW 


34 


1 


1 


1 




29 


1551.8 


VTLIHPIAMDDGLR 


35 


3 


3 


1 




30 


1803 .0 


GGDTVTLNETDLTQIPK 


36 


2 


2 


1 


40 


31 


1172.4 


VGEEVEIVGIK 


37 


1 


1 


1 




32 


2148.5 


GWQVPAFTLGGEATDIWMR 


38 


1 


1 


1 




33 
34 


2553.9 
1154.3 


VASISLPTSCASAGTQCLISGWGNTK^ 
SSGTSYPDVLK^ 


39 
40 




1 
3 


1 
1 




35 


1174.5 


TLNNDIMLIK . 


41 


1 


1 


1 


45 


36 


2274.6 


S I VHPS YNSNTLNNDIMLI K**- 


42 




2 


1 



50 



55 



* not present in the genpept database 

^ sequence appended to the human database, not originally in human 
database 

^ amino acid sequences added to database 
not in the top 100 answers 
peptide of similar sequence identified 



60 



Example #3 

Much of the information generated by the genome 
projects will be in the foxrm of nucleotide sequences. Those 
stretches of nucleotide sequence that can be correlated to a 
gene will be translated to a protein sec[uence and stored in a 
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specific database (genpept) • The un-translated nucleotide 
sequences represent a wealth of data that may be relevant to 
protein sequences. The present invention will allow searching 
the nucleotide database in the same manner as the protein 
5 sequence databases. The procedure will involve the same 
algorithmic approach of cycling through the nucleotide 
seG[uence. The three-base codon will be converted to a protein 
sequence and the mass of the amino acids summed • To cycle 
through the nucleotide sequence, a one-base increment will be 

10 used for each cycle. This will allow the determination of an 
amino acid sequence for each of the three reading frames in 
one pass. For example, an MS/MS spectrum is generated for the 
sequence Asp-Leu-Arg-Ser-Trp-Thr-Ala [Seq. ID No. 43] 
((M+H)+s848) the algorithm will search the nucleotide sequence 

15 in the following manner. 



Nucleotide sequence from the database. 

nucleotides GCG AUG UCC GGU CUU GGA CUG CUC 44 
First pass through the sequence. 
20 nucleotides GCG AUG UCC GGU CUU GGA CUG CUC Mass 44 

amino acids Ala He Ser Gly Leu Gly Leu Leu 743 45 

Second pass through the sequence. 

nucleotides G CGA UCU CCG GUC UUG GAC UGC UC Mass 44 
amino acids Arg Ser Pro Val Leu Gly Leu 741 46 

25 Third pass through the sequence. 

nucleotides GC GAU CUC CGG UCU UGG ACU GCU C Mass 44 
amino acids Asp Leu Arg Ser Trp Thr Ala 848 43 

Fourth pass through the sequence. 

nucleotides GCG AUG UCC GGU CUU GGA CUG CUC Mass 44 

30 amino acids He Ser Gly Leu Gly Leu Leu 672 45 



As •t:he sequence of amino acids matich ^he mass of 'the peptide 
the predicted sequence ions will be compared to the MS/MS 
spectrum. From this point on the scoring and reporting 
procedures for the search will be the same as for a protein 

35 sequence database. 

In light of the above description, a number of 
advantages of the present invention can be seen. The present 
invention permits correlating mass spectra of a protein, 
peptide or oligonucleotide with a nucleotide or protein 

40 sequence database in a fashion which is relatively accurate. 
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rapid, and which is amenable 'to automation (i.e., to operation 
without the need for the exercise of human judgment) • The 
present invention can be used to analyze peptides which are 
derived from a mixture of proteins and thus is not limited to 
5 analysis of intact homogeneous proteins such as those 
generated by specific and known proteolytic cleavage. 

A number of variations and modifications of this 
invention can also be used. The invention can be used in 
connection with a number of different proteins or peptide 

10 sources and it is believed applicable to any analysis using 
mass spectrometry and proteins. In addition to the examples 
described above, the present invention can be used for, for 
example, monitoring fermentation processes by collecting 
cells, lysing the cells to obtain the proteins, digesting the 

15 proteins, e.g. in an enzyme reactor, and analyzing by Mass 

spectrometry as noted above. In this example, the data could 
be interpreted using a search of the organism's database 
(e.g. , a yeast database) . As another example, the invention 
could be used to determine the species of organism from which 

20 a protein is obtained. The analysis would use a set of 

peptides derived from digestion of the total proteins. Thus, 
the cells from the organism would be lysed, the proteins 
collected and digested. Mass spectrometry data would be 
collected with the most abundant peptides. A collection of 

25 spectra (e.g., 5 to 10 spectra) would be used to search the 
entire database. The spectra should match known proteins of 
the species. Since this method would use the most abundant 
proteins in the cell, it is believed that there is a high 
likelihood the sequences for these organisms would be 

30 sequenced and in the database. In one embodiment, relatively 
few cells could be used for the analysis (e.g., on the order 
of 10^ - 10^). 

For example, methods of the invention can be used to 
identify microorganisms, cell surface proteins and the like. 

35 For identifying microorganisms, the procedure can employ 
tandem mass spectra obtained from peptides produced by 
proteolytic digestion of the cellular proteins. The complex 
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mixtiure of peptides produced is subjected to separation by 
HPLC on-line to a tandem mass spectrometer. As peptides elute 
off the column tandem mass spectra are obtained by selecting a 
peptide ion in the first mass analyzer, sending it into a 
5 collision cell, and recording the mass-to-charge (m/z) ratios 
of the resulting fragment ions in the second mass analyzer. 
This process is performed over the course of the HPLC analysis 
and produces a large collection of spectra (e.g., from 10 to 
200 or more) . Each spectrum represents a peptide derived the 
10 microorganism's protein (gene) pool and thus the collection 
can be used to develop one or more family, genus, species, 
serotype or strain-specific markers of the microorganism, as 
desired. 

The identification of the microorganism is performed 

15 using one of at least three software related techniques. In a 
first technique, a database search, the tandem mass spectra 
are used to search protein and nucleotide databases to 
identify an amino acid sequence which is represented by the 
spectrum. Identification of the organism is achieved when a 

20 preponderance of spectra obtained in the mass spectrometry 
analysis match to proteins previously identified as coming 
from a particular organism. Means for searching databases in 
this fashion are as described hereinabove. 

In a second technique a library search can be 

25 performed, such as if no solid matches are observed using the 
database search described above. In this approach the data 
set is compared to a pre-defined library of spectra obtained 
from known organisms. Thus, initially a library of peptide 
spectra is created from known microorganisms. The library of 

30 tandem mass spectra for micro-organisms can be constructed by 
any of several methods which employ LC-MS/MS. The methods can 
be used to vary the location cellular proteins are obtained 
from, and the amount of pre-purif ication employed for the 
resulting peptide mixture prior to LC-MS/MS analysis. For 

35 example, intact cells can be treated with a proteolytic enzyme 
such as trypsin, chymotrypsin, endoproteinase Glu-C, 
endoproteinase Lys-C, pepsin, etc. to digest the proteins 
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exposed on t:he cell surface. Pre-^reatiment of "the intact: 
cells with one or more glycosidases can be used to remove 
steric interference that may be created by the presence of 
carbohydrates on the cell surface. Thus, the pre-treatment 
5 with glycosidases may be used to obtain higher peptide yields 
during the proteolysis step. A second method to prepare 
peptides involves rupturing the cell membranes (e.g., by 
sonication, hypo--osmotic shock, f reeze-thawing, glass beads, 
etc.) and collecting the total proteins by precipitation, 

10 e.g., using acetone or the like. The proteins are resuspended 
in a digestion buffer and treated with a protease such as 
trypsin, chymotrypsin, endoproteinase glu-C, endoproteinase 
lys-C, etc. to create a mixture of peptides. Partial 
simplification of this mixture of peptides, such as by 

15 partitioning the mixture into acid and basic fractions or by 
separation using strong cation exchange chromatography, leads 
to several pools of peptides which can then be used in the 
mass spectrometry process. The peptide mixtures are analyzed 
by LC-MS/MS, creating a large set of spectra, each 

20 representing a unique peptide marker of the organism or cell 
type. 

The data are stored in the library in any of a 
variety of means, but conveniently in three sections, wherein 
one section is the peptide mass determined from the spectrxim, 
25 a second section is information related to the organism, 

species, growth conditions, etc., and a third section contains 
the mass/ intensity data. The data can be stored in a variety 
of formats, conveniently an ASCII format or in a binary 
format. 

30 To perform the library search spectra are compared 

by first determining whether the mass of the peptide is within 
a preset mass tolerance (typically about ± 1-3 amu) of the 
library spectrum; a cross-correlation function as described 
hereinabove is used to obtain a quantitative value of the 

35 similarity or closeness-of-f it of the two spectra. The 

process is similar to the database searching algorithm except 
a spectrum is not reconstructed for the amino acid sequence. 
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To provide a set of comparison spectra the tandem mass 
spectrum can be used to search a small (e.g., "100 protein 
sequences) randomly generated sequence database. This 
provides a background against which similarity is compared and 
5 to generate a normalized score. 

A third related technique for identifying a 
microorganism or cell involves de novo interpretation to 
determine a set of amino acid sequences that have the same 
mass as the peptide represented by the spectrum. The set of 

10 amino acid sequences is limited by using the spectral pre* 

processing equation 1, above, to rank the sequences* This set 
of amino acid secjuences then serves as the database for use in 
the searching method described hereinabove* An amino acid 
sequence is thereby derived for a tandem mass spectrum that is 

15 not contained in the organized databases. By using 

phylogenetic analysis of the determined amino acid sequences 
they can be placed within a species, genus or family and a 
classification of the microorganism is thereby accomplished. 

The methodology described above has applications in 

20 addition to identifying microorganisms. For example, cDNA 
sequencing can be carried out using conventional means to 
obtain partial sequences of genes expressed in particular cell 
lines, tissue types or microorganisms. This information then 
serves as the database for the subsequent analyses. The 

25 approach described above for the digesting proteins exposed on 
the cell surface by enzymatic digestion can be used to 
generate a collection of peptides for LC-MS/MS analysis. The 
resulting spectra are used to search the nucleotide sequences 
in all 6 reading frames to match amino acid sequences to the 

30 MS/MS spectra. The amino acid sequences identified represent 
regions of the cell surface proteins exposed to the 
extracellular space. This method provides at least two 
additional pieces of information not directly obtainable from 
cDNA sequencing. First, the spectra identify the proteins 

35 residing on the membrane of the cells. Secondly, sidedness 
information is obtained about the folding of the proteins on 
the cell surface. The peptide sequences matched to the 



W0 9SA2S281 



PCT/US95/03239 



33 

nucleotiide sequence information identifies those segments of 
the protein sequence exposed extracellularly. 

The methods can also be used to interpret the MS/MS 
spectra of carbohydrates. In this method the carbohydrate (s) 
5 of interest is subjected to separation by HPLC on-line to a 
tandem mass spectrometer as with the peptides. The 
carbohydrates can be obtained from a complex mixture of 
carbohydrates or obtained from proteins, cells, etc. by 
chemical or enzymatic release. Tandem mass spectra are 

10 obtained by selecting a carbohydrate ion in the first mass 

analyzer, sending it into a collision cell, and recording the 
mass-to-charge (m/z) ratios of the resulting fragment ions in 
the second mass analyzer. This process is performed over the 
course of the HPLC analysis and produces a large collection of 

15 spectra (e.g., from 10 to 200 or more). The fragmentation 
patterns of the carbohydrate structures contained in the 
database can be predicted and a theoretical representation of 
the spectra can be compared to the pattern in the tandem mass 
spectrum by using the method described hereinabove. The 

20 carbohydrate structures analyzed by tandem mass spectrometry 

can thereby be identified. These methods can thus be used for 
characterization of the carbohydrate structures found on 
proteins, cell surfaces, etc. 

The present invention can be used in connection with 

25 diagnostic applications, such as described above and in 
Example 2. Another example involves identifying virally 
infected cells. Success of such an approach is believed to 
depend on the relative abundance of the viral proteins versus 
the cellular proteins, at least using present equipment. If 

30 an antibody were produced to a specific region of a protein 

common to certain pathogens, the mixture of proteins could be 
partially fractionated by passing the material over an 
immunoaf f inity column. Bound proteins are eluted and 
digested. Mass spectrometry generates the data to search a 

35 database. One important element is finding a general handle 
to pull proteins from the cell. This approach could also be 
used to analyze specific diagnostic proteins. For example, if 
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a certain protein variant is known to be present in some form 
of cancer or genetic disease, an antibody could be produced to 
a region of the protein that does not change. An 
immunoaf f inity column could be constructed with the antibody 
5 to isolate the protein away from all the other cellular 
proteins. The protein would be digested and analyzed by 
tandem mass spectrometry. The database of all the possible 
mutations in the protein could be maintained and the 
experimental data analyzed against this database. 

10 One possible example would be cystic fibrosis. This 

disease is characterized by multiple mutations in CFTR 
protein. One mutation is responsible for about 70% of the 
cases and the other 30% of the cases result from a wide 
variety of mutations. To analyze these mutations by genetic 

15 testing would require many different analyses and probes. In 
the assay described above, the protein would be isolated and 
analyzed by tandem mass spectrometry. All the mutations in 
the protein could be identified in an assay based on 
structural information. The database used would preferably 

20 describe all the known mutations. Implementation of this 

approach is believed to involve significant difficulties. The 
amount of protein required could be so large that it would be 
impractical to obtain from a patient. This problem may be 
overcome as the sensitivity of mass spectrometry improves in 

25 the future. A protein such as CFTR is a transmembrane 

protein, which are typically very difficult to manipulate and 
digest. The approach described could be used for any 
diagnostic protein. The data would be highly specific and the 
data analysis essentially automated. 

30 It is believed that the present invention can be 

used with any size peptide. The process requires that 
peptides be fragmented and there are methods for achieving 
fragmentation of very large proteins. Some such techniques 
are described in Smith et al. , "Collisional Activation and 

35 Collision-Activated Dissociation of Large Multiply Charged 

Polypeptides and Proteins Produced by Electrospray Ionization" 
J. Amer. Soc. Mass Soect. I: 53-65 (1990). The present method 
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can be used "to analyze data derived from intact proteins, in 
that there is no theoretical or absolute practical limit to 
the size of peptides that can be analyzed according to this 
invention. Analysis using the present invention has been 
performed on peptides at least in the size range from about 
400 amu (4 residues) to about 2500 amu (26 residues) • 

In. described embodiments candidate sub-sequences are 
identified and fragment spectra are predicted as they are 
needed, at the time of doing the analysis. If sufficient 
computational resources and storage facilities are available 
to perform some or all of the calculations needed for 
candidate sequence identification (such as calculation of sub- 
sequence masses) and/ or spectra prediction (such as 
calculation of fragment masses) , storage of these items in a 
database can be employed so that some or all of these items 
can be looked up rather than calculated each time they are 
needed. 

While the present invention has been described by 
way of the preferred embodiment and certain variations and 
modifications, other variations and modifications of the 
present invention can also be used, the invention being 
described by the following claims. 
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WHAT IS CLAIMED IS; 

1 1. A method for correlating a peptide fragment 

2 mass spectrum with amino acid sequences derived from a 

3 database of sequences, comprising: 

4 storing data representing a first mass spectrum of a 

5 plurality of fragments of at least a first peptide; 

6 calculating a plurality of predicted mass spectra of 

7 at least a portion of a plurality of said sequences In said 

8 database of sequences; and 

9 calculating at least a first measure for each of 

10 said plurality of predicted mass spectra, said first measure 

11 being an indication of the closeness-of-f it between said first 

12 mass spectrum and each of said plurality of mass spectra. 

1 2. A method, as claimed in claim 1, wherein said 

2 first mass spectrum is provided from a tandem mass 

3 spectrometer device. 

1 3. A method, as claimed in claim 2, wherein the 

2 tandem mass spectrometer is one of a triple quadrupole mass 

3 spectrometer, a Fourier-transform cyclotron resonance mass 

4 spectrometer, a tandem time-of -flight mass spectrometer and a 

5 quadrupole ion trap mass spectrometer. 

1 4. A method, as claimed in claim 1, wherein said 

2 database of sequences Is a database of amino acid sequences of 

3 a plurality of proteins. 

1 5. A method, as claimed in claim 1, wherein said 

2 database of sequences is a nucleotide database. 

1 6. A method, as claimed in claim 1, further 

2 comprising selecting a first plurality of sub-sequences from 

3 said database of sequences, wherein said step of calculating a 

4 plurality of predicted mass spectra Includes calculating at 

5 least one predicted mass spectirum for each of said selected 

6 first plurality of sub-sequences. 
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1 7. A nelihod, as claimed in claim 1, wherein said 

2 s^ep of calculating a first: measure includes selecting those 

3 values from said first mass spectrum having an intensity 

4 greater than a predetermined threshold. 

1 8.. A method, as claimed in claim 1, further 

2 comprising normalizing said first spectrum prior to said step 

3 of calculating at least a first measure. 

1 9. A method, as claimed in claim 1, wherein said 

2 step of calculating a plurality of predicted mass spectra 

3 includes calculating predicted mass spectra for only a portion 

4 of said sequence database. 

1 10. A method, as claimed in claim 9, wherein said 

2 first peptide is derived from a protein which is obtained from 

3 a first organism and wherein said protein of said sequence 

4 database is the portion containing sequences for proteins 

5 found in said first organism. 

1 11. A method, as claimed in claim 2 wherein a first 

2 mass spectrometer of said tandem mass spectrometer device is 

3 used to separate-out a component having a first mass, an 

4 activation device of said mass spectrometer device is used to 

5 fragment said first component and a second mass spectrometer 

6 of said tandem mass spectrometer device is used provide said 

7 first mass spectrum. 

1 12. A method, as claimed in claim 1, wherein said 

2 first peptide is isolated by chromatography. 

1 13. A method, as claimed in claim 1, wherein said 

2 data representing said first mass spectrum includes a 

3 plurality of mass«-charge pairs. 
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1 14. A method, as claimed in claim 1, wherein said 

2 step of calculating predicted mass spectra comprises: 

3 deriving a plurality of masses from portions of said 

4 plurality of sequences, each mass equal to the mass of a 

5 peptide fragment which corresponds to a portion of a sec[uence 

6 in said plurality of secjuences; 

7 selecting those masses, among said plurality of 

8 masses, which are within a predetermined mass tolerance of the 

9 mass of said first peptide and storing an indication of which 

10 portion of which sec[uence each of said selected masses 

11 corresponds to, to provide a plurality of candidate sequence 

12 portions ; and 

13 calculating a plurality of mass-charge pairs for 

14 each of said candidate sequence portions, each of said mass- 

15 charge pairs having a mass substantially equal to the mass of 

16 a peptide fragment corresponding to a portion of one of said 

17 candidate sequence portions. 

1 15. A method, as claimed in claim 1, wherein said 

2 first measure comprises a correlation coefficient. 

1 16. A method, as claimed in claim 1, wherein said 

2 step of calculating a first measure comprises: 

3 calculating a preliminary score for each of said 

4 plurality of candidiate sequence portions; 

5 identifying a plurality of primary candidate 

6 portions which have a preliminary score which is greater than 

7 at least one candidate sequence which is not identified as a 

8 primary candidate portion; and 

9 calculating a correlation coefficient for each of 
10 said primary candidate portions. 

1 17. A method, as claimed in claim 8, wherein each 

2 of said plurality of mass spectra and said first maiss spectrum 

3 includes a plurality of mass-charge pairs, each mass-charge 

4 pair having an intensity value, and further comprising the 

5 step of identifying, for each of said plurality of mass 
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6 spectra, a set of matched fragments which have less 1:han a 

7 predetermined difference from corresponding fragments in said 

8 first mass spectrum; and 

9 wherein said preliminary score is the number of 

10 fragments of a predicted mass spectrum in said set of matched 

11 fragments multiplied by the sum of the intensity values for 

12 the mass-charge pairs corresponding to said matched fragmented 

1 18. A method for interpreting the mass spectrum of 

2 an oligonucleotide comprising: 

3 providing a library of nucleot:ide sequences; 

4 storing, in a database, a plurality of nucleotide 

5 sub-sequences from said library, said plurality including all 

6 sequences smaller than n-mers; 

7 storing data representing a first mass spectrum of a 

8 plurality of fragments of said oligonucleotide; 

9 calculating predicted mass spectra for each of said 

10 plurality of nucleotide sub-sequences; and 

11 calculatiing at least a first closeness-of-f it 

12 measure for each of said predicted mass spect:ra, with respect 

13 to said first mass spectrum* 

1 19. A method, as claimed in claim 18, wherein n is 

2 10* 

1 20* A method for determining whether a pept:ide in a 

2 mixt:ure of proteins is homologous to a portion of any of a 

3 plural it:y of pro-teins specified by an amino acid sequence in a 

4 database of sequences, comprising: 

5 using a tandem mass spectrometer to receive a 

6 plurality of peptides obtained from said mixture of proteins, 

7 to select at least a first peptide from said mixture of 

8 peptides, to fragment said first peptide and to generate a 

9 peptide fragment mass spectrxun; 

10 storing data representing said first mass spectrum; 

11 and 



wo 95/25281 



PCTAJS95/03239 



40 

12 correlating said mass spect:ruin with an amino acid 

13 sequence in said database of sec[uences, to determine the 

14 correspondence of a protein specified in said sequence 

15 database with a protein in said mixture of proteins. 

1 21. A method, as claimed in claim 20, wherein said 

2 step of correlating includes predicting at least one mass 

3 spectrum from said amino acid sequence. 

1 22. A method according to claim 20 wherein the 

2 mixture of proteins is obtained from a cell or microorganism 

3 to be identified. 

1 23. A method according to claim 22, wherein the 

2 mixture of proteins is obtained by proteolytic digestion of 

3 cellular proteins. 

1 24. The method of claim 23, wherein the cellular 

2 proteins are extracellular. 

1 25. A method for identifying an organism of 

2 interest by determining whether a mass spectrum or a plurality 

3 of mass spectra of peptides obtained from the organism or 

4 components thereof to be identified is contained in a library 

5 of spectra of known organisms, comprising: 

6 using a tandem mass spectrometer to receive a 

7 plurality of peptides obtained from a mixture of proteins 

8 obtained from said organism to be identified, to select at 

9 least a first peptide from said plurality of peptides, to 

10 fragment said first peptide and to generate a peptide fragment 

11 mass spectrum; 

12 storing data representing said first mass spectrum; 

13 and 

14 correlating said mass spectrum with a mass spectnxm 

15 in said library of spectra of known organisms to determine the 

16 correspondence of said spectra with the spectra obtained from 
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17 pept:ides ob'tained from the organism to be identified, thereby 

18 identifying said organism. 

1 26. The method of claim 25, wherein the organism to 

2 be identified is a bacterium, fungus or virus. 

1 27. The method according to claim 25, wherein the 

2 mixture of proteins is obtained by enzymatic digestion of the 

3 organism's proteins. 

1 28. A method for characterizing a carbohydrate 

2 structure of interest from a mixture of carbohydrates, 

3 comprising: 

4 using a tandem mass spectrometer to receive a 

5 plurality of carbohydrates obtained from the mixture of 

6 carbohydrates, to select at least a first carbohydrate ion 

7 from the mixture of carbohydrates in a first mass analyzer of 

8 the tandem mass spectrometer, to fragment said first 

9 carbohydrate and to generate a carbohydrate fragment mass 

10 spectrvim; 

11 storing data representing said first mass spectrum; 

12 and 

13 correlating said mass spectrum with a database of 

14 spectra of known carbohydrates, to determine the 

15 correspondence of a carbohydrate specified in said 

16 carbohydrate database with a carbohydrate in said mixture of 

17 carbohydrates, thereby characterizing the structure of the 

18 carbohydrate of interest. 

1 29, The method of claim 28, wherein the mixture of 

2 carbohydrates is obtained from a glycosylated protein of 

3 interest • 

1 30. The method of claim 29, wherein the mixture of 

2 carbohydrates is obtained from a glycosylated protein of 

3 interest by chemical or enzymatic release from the protein. 
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